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ABSTRACT With the advances in high-throughput genotyping technology, the study of quantitative trait 
loci (QTL) has emerged as a promising tool to understand the genetic basis of complex traits. Methodology 
development for the study of QTL recently has attracted significant research attention. Local phylogeny- 
based methods have been demonstrated to be powerful tools for uncovering significant associations 
between phenotypes and single-nucleotide polymorphism markers. However, most existing methods are 
designed for homozygous genotypes, and a separate haplotype reconstruction step is often needed to 
resolve heterozygous genotypes. This approach has limited power to detect nonadditive genetic effects 
and imposes an extensive computational burden. In this article, we propose a new method, HTreeQA, that 
uses a tristate semi-perfect phylogeny tree to approximate the perfect phylogeny used in existing methods. 
The semi-perfect phylogeny trees are used as high-level markers for association study. HTreeQA uses the 
genotype data as direct input without phasing. HTreeQA can handle complex local population structures. It 
is suitable for QTL mapping on any mouse populations, including the incipient Collaborative Cross lines. 
Applied HTreeQA, significant QTLs are found for two phenotypes of the PreCC lines, white head spot and 
running distance at day 5/6. These findings are consistent with known genes and QTL discovered in 
independent studies. Simulation studies under three different genetic models show that HTreeQA can 
detect a wider range of genetic effects and is more efficient than existing phylogeny-based approaches. We 
also provide rigorous theoretical analysis to show that HTreeQA has a lower error rate than alternative 
methods. 
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The goal of quantitative trait locus (QTL) mapping is to find strong 
associations representing (genomically proximal) causal genetic effects 
between observed quantitative traits and genetic variations. There are 
several mouse resources such as the Collaborative Cross (CC) (The 
Complex Trait Consortium 2004; Collaborative Cross Consortium 
2012), Heterogeneous Stock (Valdar et al 2006), and Diversity Out- 
bred (Collaborative Cross Consortium 2012; Svenson et al 2012) for 
large-scale association study of complex traits, among which the CC 
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captures the most genetic and phenotypic diversity (Roberts et al 
2007; Aylor et al 2011). 

Many previous QTL mapping methods consider each genetic 
marker independently (Akey et al 2001; Thomas 2004; Pe'er et al 
2006). Standard statistical tests (such as the F-test) are used to mea- 
sure the significance of association between a phenotype and every 
single nucleotide polymorphism (SNP) in the genome. These single 
marker— based methods usually do not consider the effects of (both 
genotyped and ungenotyped) neighboring markers and hence may fail 
to discover QTL for complex traits. To address this limitation, cluster- 
based methods, such as HAM (Mcclurg et al 2006), QHPM (Onkamo 
et al 2002), and HapMiner (Li and Jiang 2005), have been developed. 
Typically the genome is partitioned into a series of intervals. For each 
interval, these methods first cluster samples based on the genotypes 
within it and then assess the statistical correlation between the clusters 
and the phenotype of interest. The result is sensitive to the granularity 
of the partition, the definition of genotype similarity, and the choice of 
clustering algorithms. More importantly, these methods tend to em- 
phasize mutations as the major events that cause the differences in the 
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DNA sequences of the samples. This may not fully represent the 
genetic background underlying the differences. 

Phylogeny trees have been widely used to model evolutionary 
history among different species, subspecies, or strains (Yang et al. 
2011). Their application in association study requires inferring an 
accurate global phylogeny tree from the DNA sequences (Larribe 
et al. 2002; Morris et al. 2002; Minichiello and Durbin 2006). This 
may not be feasible for the high- density markers in current QTL 
analysis. Some recent methods, such as Genomic Control (Devlin 
and Roeder 1999), EIGENSTRAT (Price et al 2006), and EMMA 
(Kang et al 2008), build global models to account for genetic effects. 
EMMA computes a kinship matrix to correct the effect of the pop- 
ulation structure. Genomic Control estimates an inflation factor of the 
test statistics to account for the inflation problem caused by unbal- 
anced population structure. EIGENSTRAT performs an orthogonal 
transformation on the genotypes using principal component analysis 
and then conducts the association study in this transformed space. 
However, the genetic background of the samples may not always be 
adequately captured by a global model. This is particularly true for the 
incipient Collaborative Cross population (PreCC). There is no signif- 
icant global population stratification among the PreCC lines because 
each of the eight founders contributes roughly one- eighth of their 
entire genome (Aylor et al 2011). This unique design removes the 
need for global population structure correction in QTL mapping. 

However, local population structures may still exist. Because of the 
limited number of recombinations occurred since the founder genera- 
tion, the genome of each CC line is a coarse mosaic of composed 
segments from the eight founders. In a genomic region, a CC line 
may be determined by the contribution from a single founder and none 
from the rest. Because the eight founders are from three subspecies, 
local population structure may exist in these CC lines. We have ob- 
served uneven genetic background at the chromosome level in the 184 
genotyped PreCC lines, and such pattern becomes stronger when we 
examine at finer resolutions. (Please see Results and Discussion for 
further discussion of the local population structure in the PreCC lines.) 

Local phylogeny becomes a natural choice for capturing this type 
of effect. Several recent methods [e.g., TreeLD (Zollner and Pritchard 
2005), TreeDT (Sevon et al. 2006), BLOSSOC (Mailund et al. 2006; 
Besenbacher et al. 2009), and TreeQA (Pan et al. 2008, 2009)] have 
adopted local perfect phylogeny trees to model the genetic distance 
between samples. These methods examine possible groupings induced 
by each local phylogeny and report the ones showing strong statistical 
associations with the phenotype. Because these methods require a large 
number of statistical tests and their results are often corrected by large 
permutation tests, they are prone to multiple testing errors and incur 
significant computational burden. TreeLD and TreeDT can handle 
only a very small number of SNP markers and thus they are not 
suitable for large-scale QTL mapping. BLOSSOC is more efficient 
and can process the entire genome but still needs days to perform 
a large number of permutation tests. The recently proposed TreeQA 
algorithm uses several effective pruning techniques to reduce compu- 
tational burden and is able to finish large permutation tests in a few 
hours. 

A common limitation shared by all of these local phylogeny-based 
methods is that the perfect phylogeny trees can be only constructed 
from haplotypes. These methods either assume that samples are 
purebred {i.e., no heterozygosity), which is not true for many large 
mammalian resources, including the PreCC lines, or that a preprocess- 
ing step phases each genotype into a pair of haplotypes. However, 
haplotype reconstruction itself is a nontrivial process that is both 
time-consuming (Scheet and Stephens 2006) and error-prone (Ding 



et al. 2008). Even if haplotypes are phased accurately, the two hap- 
lotypes of the same sample may be located at different branches of 
a phylogeny tree and will be treated as if they were independent 
samples in subsequent statistical tests. This may create a bias favoring 
additive effects and lead to spurious results. For example, consider 
a recessive phenotype, we use Ala to represent the majority and mi- 
nority alleles at the causative locus. The local phylogeny tree built 
from the surrounding region has an edge corresponding to the caus- 
ative SNP that separates the samples into two groups carrying A and 
a alleles, respectively. Each heterozygous Ala sample is phased into 
two haplotypes, each belonging to a different group. The group having 
allele a would have mixed phenotypes. This may weaken the power of 
any statistical tests and fail to detect the causative edge (Wang and 
Sheffield 2005, Lettre et al. 2007). The scenario may become even 
worse for phenotypes having overdominant effects on heterozygous 
samples. 

Therefore, a natural question to ask is whether we can design a 
phylogeny-based QTL mapping that can be applied to unphased 
genotypes directly. In this article, we introduce the model of tristate 
semi-perfect phylogeny tree directly built from unphased genotype 
data and explore its utility in QTL study. Our method, HTreeQA, has 
the advantages of phylogeny-based methods but does not require 
a separate phasing step. We demonstrate via simulation studies that 
HTreeQA can detect a wider range of genetic effects than other 
alternative methods. 

MATERIALS 
Collaborative Cross 

We use the genotypes of 184 partially inbred mice from the CC lines 
(Aylor et al. 2011). On average, these mice have undergone 6.7 gen- 
erations of inbreeding and have 16% heterozygosity. The genotypes at 
approximately 180K SNPs are collected using the mouse diversity 
array (Yang et al. 2009). The data can be accessed through the CC 
status website (http://csbio.unc.edu/CCstatus/index.py). We study two 
phenotypes. One is the white head spot, which was originally observed 
on one of the CC founders, WSB/EiJ. Because there are no white 
head- spotted mice found in Fl crosses of the CC founders, the phe- 
notype is believed to be a recessive trait. Among the 184 mice, there 
are four with white head spot. Another phenotype we study is the 
average daily running distance for mice of 5 to 6 days old. This is 
a typical measurement for mouse activity. The phentotypes are sup- 
plied as supporting information, File SI. 

Synthetic data sets 

The phenotype was simulated using three different models of genetic 
effects: additive, recessive, and overdominant (a special case of epistasis 
effect) models. We include the overdominant model because we observe 
that heterozygous individuals sometimes exhibit extreme phenotypes. 
This phenomenon cannot be captured by an additive or recessive model. 

To simulate phenotypes, we adopt the method used in Long and 
Langley 1999. To simulate an additive phenotype for a given SNP, we 
use the following formula: 

where it is the percentage of the variation attributable to the quan- 
titative trait nucleotide, iV(0, 1) is the standard normal distribution, 
and p is the minor allele frequency. In the additive model, Q z takes 
values —1,0, and 1 for homozygous wild- type, heterozygous type, or 
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homozygous type, respectively. For recessive and overdominant 
models, we use 

„ = ^ N( o,i) + Q;.y=^, 

where p' is the fraction of individuals that are homozygous mutants. 
In a recessive model, Q- is 1 for homozygous mutant and 0 otherwise. 
In an overdominant model, Q { takes 1 for heterozygous mutant and 
0 otherwise. All causative SNPs are removed from the genotypes 
before analysis. We represent results of a wide range of realistic 
contributions of genetic variations by testing five genetic variation 
settings of tt: 0.05, 0.1, 0.15, 0.2, and 0.25. 

We simulated genotypes of 170 independent individuals. Under 
each genetic effect model, we generated 100 independent test cases 
under each setting. In each case, there are 10,000 SNPs and one 
causative SNP is randomly picked among the SNPs with minor allele 
frequency greater than 0.15. 

METHODS 
Notations 

We follow the convention of using primed notation for unphased 
genotype data. Suppose that there are m individuals and n SNPs. We 
use {S[ , S2, . . . , S' n } to represent the unphased SNPs and {Si, S 2 , . . ., S n } 
to represent the phased SNPs. The unphased genotypes can be repre- 
sented as an m x n matrix M\ where the k-th row corresponds to the 
genotype of the k-th individual and the /-th column corresponds to the 
/-th SNP marker S' t Similarly, the 2m haplotypes can be represented as 
a 2m x n matrix M, where the 2/c-th and (2k + l)-th rows correspond 

Table 1 An example of unphased data (A) # its phased data (B) # and its transformed result (C) 



A. The unphased haplotype matrix 



Sample ID S\ 


Si 


Si 


S4 


Si 


Phenotype 










A 0 


0 


1 


1 


0 


10 










B 0 


0 


1 


0 


1 


10 










C H 


1 


0 


0 


0 


2 










D H 


H 


0 


0 


0 


10 










E 1 


1 


0 


0 


0 


2 










B. The phased haplotype matrix 


Haplotype ID Si 


s 2 


S3 


s 4 


s 5 


Phenotype 










A1 0 


0 


1 


1 


0 


10 










A2 0 


0 


1 


1 


0 


10 










B1 0 


0 


1 


0 


1 


10 










B2 0 


0 


1 


0 


1 


10 










C1 0 


1 


0 


0 


0 


2 










C2 1 


1 


0 


0 


0 


2 










D1 0 


0 


0 


0 


0 


10 










D2 1 


1 


0 


0 


0 


10 










E1 1 


1 


0 


0 


0 


2 










E2 1 


1 


0 


0 


0 


2 










C. The transformed genotype matrix 


ID Si'(0) 


Si'O) 


S{(H) 


Si(0) 


SO) 


Si(H) 


S 3 '(0) 


SsO) 


S 3 (H) S4(0) 


S4(1) S4(H) Si(G) S-K1) Si(H) 


A 0 


0 


0 


0 


0 


0 


1 


1 


0 1 


1 0 0 0 0 


B 0 


0 


0 


0 


0 


0 


1 


1 


0 0 


0 0 110 


C 1 


0 


1 


1 


1 


0 


0 


0 


0 0 


0 0 0 0 0 


D 1 


0 


1 


1 


0 


1 


0 


0 


0 0 


0 0 0 0 0 


E 1 


1 


0 


1 


1 


0 


0 


0 


0 0 


0 0 0 0 0 


Bold columns are selected for build 


ng the tristate semi-perfect phylogeny tree. 
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A1(10),A2(10) _ B1(10),B2(10) 

05 



B 




A(10) B(10) 

0 4 



S 2 

D2(10),C2(2),E1(2),E2(2) 




Si 

E(2)-C(2) 





C,D,E 



Figure 1 (A) is the perfect phylogeny tree generated on the phased 
haplotypes in Table 1B. Each node is labeled by its haplotype ID, 
followed by the corresponding phenotype value. (B) is a tristate 
semi-perfect phylogeny tree generated on the unphased genotypes 
in Table 1A. Each node is labeled by its sample ID followed by the 
corresponding phenotype value. (C) is the corresponding perfect phy- 
logeny tree by deleting S{ and Si in Table 1 A, and (D) is the correspond- 
ing perfect phylogeny tree by deleting samples C and D in Table 1 A. 

to the haplotypes of the k-th individual. In the haplotype matrix M y we 
use 0 and 1 to represent the major allele and the minor allele of a SNP 
respectively. In the genotype matrix 31', we use 0, 1, and H to represent 
the homozygous major allele, the homozygous minor allele, and the 



For each markers in 
the genome, set s as an 
initial region R. 



Yes 



Expand Region R by 
adding its neighboring 
marker s. 
Set R'=RU{s}. 




Use Integer Programming method to 

generate tri-state semi-perfect 
phylogeny P for region R. Save P into T. 



No 




> For each phylogeny P in T <■ 



For each possible partition K in P < 



No 



ANOVA Test on 
K and Penotype. 




Yes 



Permutation Test on 
K and Phenotype. 



T 



If current P-value is smaller than the 
phylogeny P's P-value, set the phylogeny 
P's P-value as the current P-value. 



Yes 



Use the method described in Permutation Test for Family- 
Wise Error Rate Controlling subsection to get the threshold 
of 5% family-wise error rate for the nominal P-values. 




Figure 2 The workflow of 
HTreeQA. The inputs are the 
genotype and phenotype data. 
The output is a list of phyloge- 
nies and their P-values for mea- 
suring the association with the 
phenotype, and a threshold of 
P-value representing the 5% 
FWER. 



heterozygous allele of a SNP, respectively. Table 1A shows an unphased 
genotype matrix, and Table IB shows a phased haplotype matrix. 

Perfect phylogeny tree 

An interval along the genome consists of a set of consecutive SNPs. It 
corresponds to a submatrix C MjV ( M) of M that contains all columns 
between the u-th column and the v-th column. A perfect phylogeny 
tree is the tree representation of the evolution genealogy for an interval 
in the genome (Gusfield 1991). 

Definition 1: Given an interval Cu,v(M) of 2m haplotypes and n 
SNPs, a perfect phylogeny tree is a tree, in which the haplotype sequen- 
ces are the leaves and SNPs are the edges. Given an allele of any SNP, 
the subgraph induced by all the nodes that carry the same allele is still 
a connected subtree. 

The perfect phylogeny can be treated as an evolutionary history for 
the interval. Each edge represents the mutation event that derives two 
alleles of the corresponding SNP. All the haplotypes can be explained 
by the the evolutionary history without any recombination event. For 
example, Figure 1A shows the perfect phylogeny tree built from the 
haplotypes in Table IB. 

Compatible interval 

An interval C UtV (M) is a compatible interval if every pair of SNP 
markers in the interval pass the four-gamete test (Hudson and 
Kaplan 1985). That is, at most three of the four possible allele pairs 
{00, 01, 10, 11} appear in each pair of SNPs in the interval. This 
implies the existence of an evolution genealogy that can explain the 
evolutionary history of these two markers without recombination 
events, given the assumption of an infinite site model (i.e., no 
homoplasy). For a given interval, a perfect phylogeny exists if 
and only if the interval is a compatible interval. If a compatible 
interval is not a subinterval of another compatible interval, it is 
called a maximal compatible interval. 



Tristate semi-perfect phylogeny tree 

The multistate perfect phylogeny tree (Gusfield 2010) is a natural 
extension of the perfect phylogeny tree discussed previously. It was 
originally proposed to model the rare events having multiple muta- 
tions at a single locus. Because the perfect phylogeny cannot handle 
heterozygous site properly, we propose a novel utility of the multistate 
phylogeny in modeling heterozygosity in QTL mapping. By treating 
the heterozygous allele as the third status, a tristate phylogeny tree can 
be generated from a set of unphased genotypes. Because this third 
state is not a result of a single mutation, the tristate phylogeny tree is 
a relaxation of a perfect phylogeny tree. 

Definition 2: Given an interval C UjV (M') of m genotypes and n SNPs, 
a tristate semi-perfect phylogeny tree is a tree in which the genotype 
sequences are the leaves and SNPs are the edges. A SNP corresponds to 
an edge if only two of the three possible alleles are observed and cor- 
responds to two edges if all three alleles are observed. Given an allele of 
any SNP, the subgraph induced by all the nodes that carry the same 
allele is still a connected subtree. 

Compatibility test on genotype data 

Given an interval C uy ( !M) in the genotype matrix, we construct a bi- 
nary matrix C UjV (M'). Each column S- in C UjV (M) corresponds to 
three binary columns S?(0), S?(l), and S f '(H) in C UiV (M'). S?(0) is 
generated from S\ by replacing every 'H' in S[ by T. S-(l) is generated 
from S\ by replacing every 'H' in S\ by '0'. S\(H) is generated from S[ 
by replacing every 'H' in S[ by '1' and '0' and T in S[ by '0.' This is 
equivalent to representing the '0,"l,'and 'H' alleles in the heterozygous 
S[ by triplets (0,0,0), (1,1,0), and (1,0,1), respectively. For example, 
Table 1C shows the generated binary matrix C UjV (M) for the geno- 
type matrix C U>V (M) in Table 1A. Note that all states in C U ^ V (M) are 
identical to that in C U}V (M f ) except the 'H' alleles and S'(H) columns. 
Given an interval, the following theorem states the necessary and 
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Figure 3 Four phylogenies of 43 randomly selected (from a total of 1 84) PreCC mice. The sum of the edge depth between a leaf and the origin represents the 
genetic distance of the corresponding mouse from the common ancestry of the 43 mice. The mice with white head spot are highlighted in red. Their nearest 
common ancestor is indicated by a circled "A" in each figure. In (A), the global phylogeny is balanced, and all mice are almost equally distant from each other. 
The phylogenies in (B) and (C) are no longer balanced, with several deep branches. The local population structure is a confounding factor that complexes the 
QTL analysis. The tristate semi-perfect phylogeny in (D) has the simplest structure, with an informative branch that contains all four white spot mice. 



sufficient condition for the existence of a tri-state semi-perfect phy- 
logeny (Dress and Steel 1992). 

Theorem 1: Given an interval C UiV (M') in the genotype matrix, 
there exists a tristate semi-perfect phylogeny, if and only if there 
exists a submatrix S formed by selecting two of the three columns in 
C uy (M r ) for each SNP marker, and any pair of columns in S pass 
the four -gamete test. 



An integer linear programming approach (Gusfield 2010) can be 
used to determine whether an interval is compatible and to compute 
the submatrix S. For example, in the matrix C U ^(M') shown in Table 
1C, the columns selected for S are boldface. Once S is computed, 
a tristate semi-perfect phylogeny tree can be constructed by applying 
any standard perfect phylogeny tree algorithm on S. For example, 
Figure IB shows the tristate semi-perfect phylogeny tree constructed 
from the matrix S in Table 1C. 
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Figure 4 Three kinship matrices represent the genetic relatedness over the entire genome between any pair of the 184 CC mice based on the 
whole genome (A), the chromosome 10 (B), and the 20-Mbps interval in Chromosome 10 (C) respectively. The mice are arranged in the same 
order in both x and y axes. In (A), all off-diagonal entries have almost identical values, suggesting that there is no global population structure. In (B) 
and (C), the mice are arranged in the order of their genetic relatedness, genetically similar mice are near each other. 



If there is no heterozygous allele, each genotype will be composed 
of two identical haplotypes; the tristate semi-perfect phylogeny tree is 
identical to the perfect phylogeny tree constructed on the haplotypes. If 
there are some heterozygous genotypes, removing the rows or columns 
in the matrix containing the heterozygous alleles does not affect the 
remaining part of the phylogeny tree. The tree in Figure 1C shows 
the perfect phylogeny tree constructed on S3 , S4 , S5 in Table 1 A, 
which can also be derived by collapsing the three edges labeled by 
S[ or Si in Figure IB. If we remove nodes C and D (that have 
heterozygous genotypes) in Figure IB, the resulting tree is also iden- 
tical to the perfect phylogeny tree constructed on A, B, E (Figure 
ID). We observe that any heterozygosity only introduces local var- 
iations in a phylogeny tree. 

Another important observation can be made by comparing the 
perfect phylogeny tree constructed on the haplotypes to the genotype 
matrix. When the genotype matrix contains a small percentage of 
heterozygosity, the tristate semi-perfect phylogeny tree shares a sub- 
stantial common structure with the perfect phylogeny tree on the 
haplotypes. Figure 1A shows the perfect phylogeny tree constructed 
on the haplotypes in Table IB. Note that the two haplotypes (e.g., Dl, 
D2) of the same genotype (e.g. D) may be associated with different 
nodes in the tree. We will show later that this decoupling will weaken 
the power of detecting nonadditive genetic effects. However, this tree 
shares common induced subtrees with the tristate semi-perfect phy- 
logeny tree. Removing the nodes associated with the decoupled hap- 
lotypes will result in Figure ID, whereas collapsing edges connecting 
these nodes will result in Figure 1C. 

Phylogeny tree -based test 

An edge in a phylogeny tree connects two disjoint subtrees. Removing 
x edges partitions the tree into x + 1 subtrees. For example, removing 
the two edges labeled with S[ and Si in Figure IB partitions genotypes 
into three groups {A, B, D}, {C}, and {E}. 

The statistical correlation between a partition and the phenotype can 
be examined by the F-statistics. Assuming that for a total of t individuals, 
we havep groups, and the zth group contains f f individuals. We use Xy to 
represent the ith element in the jth group, Xj to represent the mean of 
the jth group, and X to represent the overall mean value. Given such 
a grouping of phenotype values, G, the F-statistics is denned as 



F(G) = 



Xi-X) 



Xij Xj 



distribution, an F-test is applied to obtain the corresponding P- value. 
Otherwise, a permutation test is needed. The P-value is denned as 
n p erm where nPerm is the number of permutations and n is the num- 
ber of times when the F-statistics of the permuted phenotype is larger 
than F(G). 

We examine all possible partitions generated by removing edges in 
the tree. The partition that generates the most significant P-value is 
reported. The corresponding P-value is used as the nominal (uncor- 
rected) P-value of the association between the compatible interval and 
the phenotype. 

Permutation test for family-wise error rate 
(FWER) controlling 

Appropriate multiple testing correction is crucial for QTL studies. In 
HTreeQA, we apply the widely used permutation test to control 
family- wise error rate (Westfall and Young 1993; Churchill and Doerge 
1994). In each permutation, the phenotype values are randomly shuf- 
fled and reassigned to individuals. For each permuted phenotype, 
we repeat the previously described procedure and find the smallest 
P-value. The corrected P-value is the proportion of the permuted data 
whose P-values are more significant than that of the original data. We 
refer to such a corrected P-value as the permutation P-value. The basic 
routine of HTreeQA is summarized in Figure 2. 

Comparison between TreeQA and HTreeQA 

We outline two alternative approaches for local phylogeny-based QTL 
mapping methods and discuss their pros and cons. 

• HTreeQA: We compute compatible intervals by using integer lin- 
ear programming and construct a tristate semi-perfect phylogeny 
tree for each compatible interval. Then we follow the procedure 
described above to find significant associations. 

• Running TreeQA on phased data: We first phase the genotypes 
using any standard phasing algorithm and then apply TreeQA on 
the resulting haplotypes. Each haplotype is assumed to have the 
same phenotype value as the original genotype. 

The second approach has an inherent drawback. It decouples the 
two haplotypes of the same genotype. As a result, the two haplotypes 
may reside in remote branches of the tree, which limits the ability to 



(1) Table 2 Selected methods for comparison 



Methods 



The corresponding P-value of F(G) can be calculated in the follow- 
ing way. If the phenotype values from each group follow a normal 



Nonphylogeny-based methods 
Phylogeny-based methods 



SMA, HAM, EMMA 
BLOSSOC, TreeQA, HTreeQA 
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Figure 5 QTL mapping of the white head spot phenotype. Only the SNPs that have top 0.5% -log(p-value) or BLOSSOC score are plotted. One 
QTL is detected by HTreeQA, which is near the location of gene kit ligand. The remaining methods except HAM have similar results to that of 
HTreeQA. The dashed line is the significance level with FWER = 0.05. (A) Result from HTreeQA. (B) Result from TreeQA. (C) Result from EMMA. 
(D) Result from BLOSSOC. (E) Result from HAM. (F) Result from SMA. 



test certain genetic effects in QTL mapping. For example, the 
phenotype in Table 1A follows a recessive model denned on Si : 
the phenotype is 2 for samples (C, E) having minor allele (T) and 
is 10 for the remaining samples A, B, D (with alleles '0' or 'H'). There 
does not exist a set of edges in Figure 1A that can perfectly separate 
these two groups. (The haplotype D2 will always be in the same group 
as CI, El, E2.) In contrast, the tristate semi-perfect phylogeny tree has 
an edge Si that perfectly separates A, B, and D from C, E. Therefore, 
the tristate semi-perfect phylogeny tree is more suitable for handling 
heterozygosity in association studies. We provide a theoretical com- 
parison of these two approaches in Appendix 1. 

RESULTS AND DISCUSSION 
Population structure in the PreCC lines 

Population stratification is an important issue in QTL analysis. 
Spurious associations may be induced by the stratification if it is 
not addressed properly (Kang et al. 2008). The combinatorial breeding 
design of the CC yields genetically independent incipient CC lines, 
which ensures balanced contributions of all eight founder strains 
without noticeable global population stratification (Aylor et al. 
2011). Figure 3 A shows a global phylogeny tree of 43 randomly se- 
lected PreCC lines. The balanced tree structure illustrates that these 
mice are genetically diverse and equally distant from each other. This 
observation is further confirmed by the kinship matrix in Figure 4A 
used by EMMA for modeling genetic background (Kang et al. 2008). 
In Figure 4A, each row (column) of the kinship matrix corresponds to 
a CC strain. Each entry in the matrix is the kinship coefficient that 
represents the genetic relatedness between the two mice. We can 



observe that all off-diagonal entries in Figure 4A have almost identical 
values (around 0.8), which suggests that no significant global popula- 
tion stratification exists in these PreCC mice. (In Appendix 2, we 
provide a statistical analysis that EMMA degenerates to a standard 
linear model when applied to the CC lines.) 

Although the genome of each CC line receives a balanced con- 
tribution from each founder strain, the founder contribution is not 
uniformly distributed along the genome because of the small number of 
recombination events undergone by each CC line. The genome of a CC 
line is essentially a mosaic of a small number of founder haplotype 
segments. On average, Pre-CC autosomal genomes had 142.3 segments 
on average (SD = 21.8) with a median segment length of 10.46 Mb 
(Aylor et al. 2011). As a result, some local subpopulation structure may 
be observed because the eight founder strains are not equally distant 
from each other (i.e., three of founders are wild strains). The subpopu- 
lation structure is visible at the chromosome level. For example, there are 
several deep branches in the phylogeny tree of the selected PreCC mice 
built on Chromosome 10 (Figure 3B). The corresponding kinship matrix 
in Figure 4B shows that there are at least three subpopulations. The 
subpopulation structure is more evident if we narrow down to a 20 Mbps 
interval from 85 Mbps to 105 Mbps on Chromosome 10. The phylogeny 
tree in Figure 3C becomes more skewed, and the corresponding kinship 
matrix in Figure 4C also exhibits more pronounced structural patterns. 

Selected methods for comparison 

We compare our algorithm HTreeQA with existing methods: TreeQA 
(Pan et al. 2008, 2009), BLOSSOC (Mailund et al. 2006; Besenbacher 
et al. 2009), EMMA (Kang et al. 2008), and HAM (Mcclurg et al. 
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Figure 6 QTL for mice daily average running distance. Only the SNPs that have top 0.5% -log(p-value) or BLOSSOC score are plotted in the 
figure. The dashed line is the significance level with FWER = 0.05. (A) Result from HTreeQA. (B) Result from TreeQA. (C) Result from EMMA. (D) 
Result from BLOSSOC. (E) Result from HAM. (F) Result from SMA. 



2006) using both real and simulated data sets. Some other methods, 
such as HapMiner (Li and Jiang 2005) and TreeLD (Zollner and 
Pritchard 2005), are too slow to process large data sets. For compar- 
ison purposes, we also implemented two other methods: SMA (single 
marker association mapping) and HAM (haplotype association map- 
ping). In SMA, each SNP marker partitions samples into groups on 
the basis of the alleles. Analysis of variance is used to evaluate the 
significance of the partition. In HAM, a sliding window of three 
consecutive SNP is used to group samples on the basis of their sequen- 
ces, and an analysis of variance is conducted to test the association 
between the phenotypes and the grouping. FastPhase (Scheet and 
Stephens 2006) is used to reconstruct haplotypes from the geno- 
types for the methods that require haplotype data (TreeQA and 
BLOSSOC). 

Note that BLOSSOC, TreeQA, and HTreeQA are phylogeny-based 
methods. SMA, HAM, and EMMA are nonphylogeny-based methods. 
Although EMMA offers an option to use global phylogeny to estimate 
the kinship matrix, it does not test the associations between the 
phenotype and the phylogenetic trees. Table 2 shows the selected 
methods for comparison. 

Performance comparison on the white head 
spot phenotype 

The white head spot is known as a recessive trait carried by WSB/EiJ 
(Aylor et al 2011). We apply the selected methods to the white 
head spot phenotype. A permutation test is applied to control the 
FWER (Westfall and Young 1993, Churchill and Doerge 1994). With 
FWER = 0.05, all the selected methods except HAM identify a QTL, 
which is approximately 100M bps in Chromosome 10 (Figure 5). This 



QTL is close to a gene named kit ligand known to be controlling white 
spotting (Aylor et al 2011). HAM fails to detect the QTL because it 
does not consider the compatibility between consecutive SNPs. The 
incompatibility between two consecutive SNPs suggests a high possi- 
bility of having a historical recombination event between them. Treat- 
ing an interval containing incompatible SNPs as a single locus may 
lead to spurious results. The phylogeny-based methods, including 
HTreeQA, can avoid this problem by only examining phylogeny trees 
constructed from compatible intervals. 

In each panel of Figure 3, A— D, the nearest common ancestor of 
the four white head spot mice (highlighted in red) is marked by a cir- 
cled "A." We observe from Figure 3, A— C that the distance between 
the common ancestor and the four mice becomes smaller when the 
interval on which the tree is built becomes shorter. It is evident that the 
four white spot mice are clustered in the phylogeny tree built over the 
20 Mb region in Figure 3C, despite the local population structure. This 
becomes clearer in Figure 3D, where the four white head spot mice 
having white head spot located on the same branch of the tristate semi- 
perfect phylogeny tree built on the compatible interval at the QTL. This 
demonstrates the effectiveness of the proposed model. 

Performance comparison on the mouse running 
distance phenotype 

We apply the selected methods on the phenotype "Mouse Running 
Distance at day 5/6." With FWER = 0.05, all the methods except SMA 
identified a QTL at 169 to 169.2 Mbp (89 cM) on Chromosome 1 as 
shown in Figure 6. The QTL falls into the previously reported cplaq3 
region (Mayeda and Hofstetter 1999). A later study also confirmed 
this QTL (Hofstetter et al 2003). 
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Figure 7 Comparison of HTreeQA, TreeQA, SSA, BLOSSOC, EMMA, and HAM under different genetic models. (A), (D), and (G) are under 
additive models; (B), (E), and (H) are under recessive models; (C), (F), and (I) are under overdominant models. 



Among the selected methods, only HTreeQA identified another 
QTL with FWER = 0.05, in the region of 16 M to 25 Mbps (8-12.5 cM) 
on Chromosome 12. The QTL falls into an unnamed QTL region at 11 
cM on Chromosome 12 reported in (Hofstetter et al 2003). The reason 
that many methods fail to report this QTL is that these methods have 
limited power in detecting non-additive effects. This result demon- 
strates that HTreeQA can detect more types of effects than the other 
methods. 

Simulation study 

To examine the performance of HTreeQA in a controlled environ- 
ment, we simulated three different types of effects: additive, recessive, 



and overdominant. For each selected method, only the SNPs with 
significance level FWER = 0.05 are reported as QTL. Because we 
remove the causative SNPs in the simulated data before we run QTL 
analysis, to measure the accuracy of the result, we considered a reported 
QTL a true positive when it was located within 50 SNPs from the 
causative SNP. We used three measurements to estimate the perfor- 
mance of each method: precision, recall, and Fl score. Precision is 
denned as the ratio between the number of true QTL that are detected 
and the total number of detected QTL. Recall is defined as the ratio 
between the number of true QTL that are detected and the total 
number of true QTL that are simulated. The Fl score is the harmonic 
mean of precision rate and recall rate, and is defined as follows: 
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Table 3 Running time comparison of the selected methods 



Methods 



Running 
Time 



Require Haplotype 
Reconstruction? 



SMA 

BLOSSOC 

HAM 

TreeQA 

EMMA 

HTreeQA 



10 min 
40 hr 
20 min 
40 hr 

3 hr 20 min 
12 min 



No 
Yes 
No 
Yes 
No 
No 



The running time is measured on a machine with Intel i7 2.67-GHz CPU and 8-G 
memory. 



Fl 



2 x Precision x Recall 
Precision + Recall 



Figure 7 compares selected methods. HTreeQA shows comparable 
performance to that of other methods in the additive model In the 
recessive model and the overdominant model, HTreeQA demon- 
strates significant advantage over other methods. Because HTreeQA 
does not have any assumption of the type of genetic effect, it offers 
consistent power for detecting any effect. Other methods except HAM 
implicitly assume the additive model. 

The phasing step required by the phylogeny-based methods 
BLOSSOC and TreeQA (for handling heterozygosity) will impair 
their ability in detecting associations between the phylogeny and 
the phenotype. The extent of its effect varies for different genetic 
models, especially with regard to heterozygous samples. It affects 
the additive model the least and overdominant model the most. For 
a homozygous sample, the nodes corresponding to the two haplotypes 
carry the same allele, and thus their phenotypes always belong to the 
same allele group. This may cause minor inflation of the QTL signals 
because the two haplotypes are treated as independent samples by 
these methods. For a heterozygous sample the two haplotypes carry 
different alleles and therefore their corresponding nodes and pheno- 
type are in two allele groups. Under the additive model assumption, 
one allele group contains all homozygous samples with high phenotype 
values, and the other contains all homozygous samples with low 
phenotype values. The heterozygous samples have medium phenotype 
values, which are added to both allele groups. This may cause minor 
deflation of the QTL signals. This is why all selected methods have 
comparable performance. TreeQA slightly outperforms others because 
its local phylogeny trees can well model the local population structure 
and separate QTL signals from genetic background. 

However, under the assumption of overdominant model, het- 
erozygous samples may have extreme phenotype values (beyond 
the range of phenotype values of the homozygous samples). These 
extreme phenotype values will always be in both allele groups; 
therefore, the phylogeny representation for phased data cannot 
explain the overdominant effects at all. This is why the traditional 
phylogeny-based methods like BLOSSOC and TreeQA fail under 
such a model. Note that HTreeQA does not require phasing. The 
tristate semi-perfect phylogeny tree has a partition that separates 
the heterozygous samples from the homozygous samples and thus 
it is able to detect an overdominant effect. Under the recessive 
model assumption, the heterozygous allele carries the same effect 
as one of the two homozygous alleles. Thus, the impact of assigning 
haplotypes of the heterozygous samples to the two allele groups is 
greater than that under the additive model and is not as great as 
that under the overdominant model. Again, this does not affect 
HTreeQA. Overall, HTreeQA has the best performance in recessive 
models and overdominant models. 



Running time comparison 

We present the running time for each selected method on a machine 
with Intel i7 2.67-GHz CPU and 8-G memory. We tested all methods 
using a dataset containing 180K SNPs and 184 individuals. Table 3 
shows the running time of these methods. If phasing is required, this 
step usually takes more than 40 hr and dominates the running time. 
HTreeQA demonstrates a great advantage by completely avoiding 
haplotype reconstruction. It is more than 600 times faster than the 
other methods that require haplotype data. HTreeQA is 15 times faster 
than EMMA because it does not need to explicitly incorporate the 
effect of global population structure as EMMA does. The running time 
of HTreeQA is comparable with that of SMA and HAM, the simplest 
models for QTL studies. They are not as effective as HTreeQA, as 
demonstrated in the real phenotype and simulation studies. 

The choice between HTreeQA, TreeQA, and EMMA 

HTreeQA is proven to have an overall lower error rate than TreeQA 
and other similar approaches (in Appendix 1). It can handle hetero- 
zygous genotype properly. It is suitable for genome-wide association 
studies on any populations, including the incipient CC lines, Hetero- 
geneous Stock, Diversity Outbred, and Recombinant Inbred Crosses 
of CC lines. TreeQA is the best choice if one focuses on the additive 
effects. EMMA can correct for global population structure but is not 
able to address any local population structure. It degenerates to a sim- 
ple linear model when applied to CC population with an evenly 
distributed global population structure as shown in Appendix 2. This 
represents a limitation of EMMA because local population structures 
exist in every mammalian resource, even though we only show the 
results on the CC population in this article. 

CONCLUSIONS 

We propose a novel approach for local phylogeny-based QTL 
mapping on genotype data without haplotype reconstruction. We 
analyze the incipient CC and show that there is no significant global 
population structure but visible local population structure. Such local 
population structure may bias the QTL mapping if it is not addressed 
properly. The notion of a tristate semi-perfect phylogeny tree is 
introduced to represent accurate genetic relationships between 
samples in short genomic regions. As a generalization of the perfect 
phylogeny tree (denned on haplotypes), a tristate semi-perfect 
phylogeny tree treats the heterozygous allele as the third state. It 
provides the power of modeling a wide range of genetic effects and 
delivers unbiased and consistent performance. It also guarantees 
a lower theoretical error rate of statistical tests than the perfect 
phylogeny based approach. This is a significant advantage over any 
previous methods that have strong bias toward an additive model. It is 
also worth noting that HTreeQA is much more computationally 
efficient than any alternative approach. 
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APPENDIX 1 



THEORETICAL ANALYSIS ON HTREEQA AND TREEQA 

In this section, we present the theoretical analysis of HTreeQA and TreeQA under different genetic models. It can be shown that HTreeQA has 
a theoretical advantage over the general phylogeny-based approach using phased haplotypes. We first prove that testing single SNPs on genotypes 
data has lower error rate than on phased haplotype data. We then analyze its potential effect on these two different phylogeny approaches. 

We assume that the causative SNP contains n x homozygous subjects, n n heterozygous subjects, and n 0 homozygous wild subjects. We also 
assume that the phenotypes can be approximated by a normal distribution, which is a reasonable assumption in most cases. We use X a , X ih , and 
X i0 to model each subject in these three groups: 



X i0 ~ N(0,<|>o) 
X ih - N(m h ,$ h ) 

Without loss of generality, we assume the samples are independent and follow three normal distributions with different means and variances 
for each group. If m n equals 0 or 1, it is a recessive model. If m n is between 0 and 1, it is an additive model. Otherwise it is an overdominance 
model. If we use a phylogeny-based approach on phased haplotypes, each homozygous subject has a duplicate homozygous subject, and each 
heterozygous subject is treated as two different homozygous subjects. Thus we could use two groups to represent the partition of this SNP, {X n , 
. . ., X Wl i, X n , . . ., X Wl i, Xi h . . ., X Hhh } and {X 10 , . . ., X„ hh , X 10 , . . ., X„ o0 > X lK . . ., X nhh }. If we use HTreeQA, which is directly applied on genotype 
data, there are three groups based on the allele of each subject, {X n , . . ., X Wl i}, {X^ . . ., X„ A /J, and {X 10 , . . ., X„ 0 o}. 

X i= 12?-iXa (A1) 

Ml 

Ro = J2i°=iX i0 (A2) 

"0 

Rh = YJiiXih (A3) 

n h 

f 2ZlLiXn + J2iiiX ih (A4) 
2«i + n n 

^_ E^i^i+Er o =^o+Erii^ (A6) 

«i + n 0 + n h 



^Haplotype ~ 2«1 (X 1 - X) 2 + tl h (X i - X) 2 

+ n h (X 0 -X) 2 + 2n 0 (X 0 -X) 2 



* Haplotype 



^Genotype ~ n\(X\ ~ X) 2 + n n (X h - X) 2 

+ n 0 (X 0 -X) 2 



n h ^ 

T Genotype (X'l — X\ ) + / . (X'l — 
i=l £f 

+ ^(X,i-X 0 ) 2 (A10) 



(A7) 



= 2Y l [Xa-xA + VXj-X, 
(=1 v ' , = 1 v ' 

"h 2 "° / _ \ 2 

+ X) (Xa - X o J + 2 X [Xio ~ X o J (A8) 



(A9) 
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Following Equation 1 in the Methods and Al to A10, we define F Hap i otype and F Genotype to represent the F-statistics of these two different 
groupings respectively. 



^Haplotype 



Fa 



'enotype ' 



^Haplotype 
^Haplotype 

_ ^Genotype 
T Genotype 



For the following analysis we assume that n l5 n ni and n 0 are large numbers, and we use 'a ~ U to denote a and b are asymptotically equal 
when the sample size approaches infinity. Here b is a number instead of a distribution. Similarly, we use and to represent asymptotically 
less than and greater than relationship respectively. Next, we prove that directly testing associations between a phenotype and the genotypes has 
a lower error rate than testing the association between the phenotypes and phased haplotypes when the sample size is large. 

First, for large sample sizes, we have the following lemmas as an immediate consequence of the Weak Law of Large Number Theorem, 

LEMMA 1 

X\ rsj IXq rsj 0 

- , 2m + m h 

X h ~ m h X i ~ — 

2«i + n n 

-, m h m+ m h n h 

A o ~ : A 



2no + n n n\ + no + 

LEMMA 2 Shaplotype ^ 2S ge/n/0 type 
Proof 

Sketch: The asymptotic values for variables in Equations A7 and A9 are determined by Lemma 1. And the expanded form of S Ha pi otype — 
2S Genotype is a quadratic function of and its discriminant is smaller than 0. 

LEMMA 3 

N random variables Y { are independent and identically distributed, with mean value fx and finite variance cp. For any real number y^/i, when 
N — > oo, we have 

P ( f> " ^) 2> f> - M-) 2 J -1- (AH) 



Proof 

Without loss of generality, we assume fx — 7 > 0, 



/ n N \ 

Vi=l ;=i / 

= p ( |> - ^ + 2 E ( y < - ^ - 7) + ^> - ^) 



z'=l i=l z=l 



> 

z=l 



P^7 f -n f x>-iV( Mj -7)/2 

y/N$ ^ 2(|> 
2<|) 

► 1 (h— > 00 ) 



!-<£(- 7 ^ ) (Central Limit Theorem) 
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LEMMA 4 T HA p LOTY p E < 2T GENOTY pe 
Proof 

X\ , X 0 and Xh converge to the mean of X a , X i0 and X ih by Lemma 1, but X i and X 0 converge to two different values as shown in Lemma 
1. Lemma 4 follows directly from Lemma 3. 

THEOREM 2 F HAPLO type ~ F genotype 
Proof 

This can be directly proved from Lemmas 2 and 4. 

We use F Nu u to represent the statistics of testing non-causative partitions from either a semi-perfect phylogeny tree or a perfect phylogeny 
tree. Because phenotype values can be approximated by a normal distribution, the distributions of F Nu n using these two approaches converge to 
the same distribution. Although it is unlike that the causative SNP is genotyped in real situation, by linkage disequilibrium, there exists a partition 
in the semi-perfect phylogeny tree or the perfect phylogeny tree based on neighboring SNPs that is very similar to the partition of the causative 
SNP. Therefore, we have the following theorem. 

THEOREM 3 P(Pnull > ^haplotype) ~ P(F/vull > F genotype) 

The probabilities in the Theorem 3 are the error rates of TreeQA on phased haplotypes and HTreeQA on genotypes. 



APPENDIX 2 



EMMA WILL DEGENERATE TO STANDARD LINEAR MODEL IN COLLABORATIVE CROSS 

First, we define a new class of matrix named K uni f orm (D y S), 







s ■ 




^uniform \P : = 


s 


D ■ 


■ s 




u 


S . 





(A12) 



where D represents the diagonal entries and S represents the off-diagonal entries in the matrix. 

Assume that y is a vector of phenotypes, X is a vector of fixed effects from a SNP, and e is a vector of residual effects for each individual. We 
omit the indicator matrix Z used in original EMMA model, because in the CC data, Z is an identity matrix. The EMMA model is presented in the 
following form: 

y=ul+X(3 + u + e (A13) 
u - MVN(0,(j 2 K K emma ) (A14) 

JUL - Norm(o,(i 2 ^ (A15) 

MVN(0,v 2 e K uniform (hO)) (A16) 



e 



where MVN represents a multivariate normal distribution. K emma is the kinship matrix inferred by the EMMA package. 
Similarly, a standard linear model is in the following form: 

y = juul + Xp + e (A17) 

JUL - Norm (0,o-jQ (A18) 

e ~ MVN(0,aX^(l,0)) (A19) 

Assuming the samples of a population have exactly the same relatedness S: 

K U niform(li S) = K un if orm (S, S) + K un ^ orm (l — S, 0) (A20) 

M-l ~ MVN(0,a^„ m/oral (l,l)) (A21) 

e ~ MVN^^K^p^O)) (A22) 
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Thus, if K emma = K uni f orm (l, S), by re- factorization of the random effects in the EMMA model, we have 

y = juul + XP + e (A23) 

ul ~ MVN(0, K uniform (<il + oiS, o£ + a|S)) (A24) 
e ~ MVN(0,cr 2 e K unifonn ((l - a|)S + 1,0)) (A25) 

This has the same form of a standard linear regression model. In CC, the kinship matrix can be represented by a K uni f orm matrix with tolerable 
numerical error. This suggests that there is no significant difference between EMMA and the standard linear regression model when these two 
methods are applied to Collaborative Cross data. 
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